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Abstract 

Recurrent neural networks are powerful tools for handling incomplete data problems in 
computer vision, thanks to their significant generative capabilities. However, the compu¬ 
tational demand for these algorithms is too high to work in real time, without specialized 
hardware or software solutions. In this paper, we propose a framework for augmenting re¬ 
current processing capabilities into a feedforward network without sacrificing much from 
computational efficiency. We assume a mixture model and generate samples of the last 
hidden layer according to the class decisions of the output layer, modify the hidden layer 
activity using the samples, and propagate to lower layers. For visual occlusion problem, 
the iterative procedure emulates feedforward-feedback loop, filling-in the missing hidden 
layer activity with meaningful representations. The proposed algorithm is tested on a 
widely used dataset, and shown to achieve 2x improvement in classification accuracy 
for occluded objects. When compared to Restricted Boltzmann Machines, our algorithm 
shows superior performance for occluded object classification. 
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1. Introduction 


In many applications from a wide variety of fields, the data to be processed can 
partially be affected by severe noise in several phases, e.g., occlusions during a visual 
recording, a scratch on a compact disc or packet losses during transmission in a com¬ 
munication channel (Figure [T]). Missing data problems have been extensively studied for 
various different purposes (see [l]] for a recent review). Such problems often severely de¬ 
grade the performance of the target application; for instance pedestrian detection under 
occlusion [2| or face recogntion [3| ■ 

Classification of objects under occlusions is an important problem in computer vision 
that has been previously tackled from both the cognitive 0,S,0 and computational 
perspectives 0, H, 0 fioj . The proposed solutions are in the domain of inference with 
incomplete data 11), generative models for classification problems 
neural networks. 
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Figure 1: Incomplete data problem arises when you try to read from a scratched disk or recognize objects 
under occlusion. Adapted from www.f7sound.com and imagebank.osa.org 


In general incomplete data classification setting, the missing data attribute is known, 
but it is not the case in visual occlusions: detecting and localizing occluded region is not 
an easy task. In this paper, a novel al gori thm based on mixture models 1 1 ill i , multiple 11 1 


and K-Nearest Neighbor imputation M is proposed. In our algorithm, neural network 
0 hidden layer activities are imputed in a recurrent architecture. The algorithm is 
tested on occluded object classification, where the occlusion pattern is completely un¬ 
known and its effect on the hidden layer activity is distributed. Our approach does not 
attempt to localize the occlusion but mitigate its effect on classification. 

The proposed multiple imputation approach can be considered as a pseudo-recurrent 
processing in the network as if it is governed by the dynamical equations of the corre¬ 
sponding hidden layer activities. This framework provides a shortcut into the feedback 
computation, which is suitable for real-time operation. The experiments show that the 
proposed algorithm improves classification performance significantly. When compared 
to Restricted Boltzmann Machines, our algorithm shows much superior performance and 
we conclude that energy based recurrent neural networks seem to be beneficial only when 
the occlusion is successfully localized. In the following, after discussing the related work 
in the literature, we describe the details of the proposed algorithm and present results 
on modified CIFAR 10 dataset flfl. 


1.1. Recurrent Neural Networks 

Recurrent Neural Networks (RNNs) are connectionist computational models that uti¬ 
lize distributed representation and nonlinear dynamics of its units. Information in RNNs 
is propagated and processed in time through the states of its hidden units, which make 
them appropriate tools for sequential information processing. There are two broad types 
of RNNs: stochastic energy based RNNs with symmetric connections, and deterministic 
ones with directed connections. 

RNNs are known to be Turing complete computational models 0 and universal 
approximators of dynamical systems [l8|. They are especially powerful tools in dealing 
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with the long-range statistical relationships in a wide variety of applications ranging from 
natural language processing, to financial data analysis. 

Despite their immense potential as universal computers, difficulties in training RNNs 
arise due to the inherent problems of learning long-term dependencies m 0, m and 
convergence issues [221 ]. However, recent advances suggest promisin g ap proaches in over¬ 
coming these issues, such as using better nonlinear optimizers 23l I24|. adopting hybrid 
strategies 0] or utilizing a reservoir of coupled oscillators [2(1 [27]]. Nevertheless, RNNs 
remain to be computationally expensive in both the training as well as the test phases. 

RNNs are shown to be very successful generative models for data completion I28| and 
specifically in vision 0- Although feedforward neural networks 29j, 3(1 311, auto associa¬ 
tive neural networks [32j and self organizing maps [0 are also used for data completion 
tasks, recurrent neural networks are a more natural choice due to their great generative 
capabilities. Restricted Boltzmann Machines (RBM) are shown to very successfully fill in 
the missing pixels due to visual occlusions 0, but their classification performance were 
not measured for standard datasets. 

The idea in this paper is to imitate recurrent processing in a feedforward network 
and exploit its power while avoiding the expensive energy minimization in training, or 
computationally heavy sampling in test. More importantly, in our study we assume that 
the location of occlusion is unknown as opposed to 0,0 , which is the case in most real 
life applications (i.e. pedestrian detection [2|). 


1.2. Classification with Incomplete Data 

Classification and clustering with missing data is a well-studied problem in the ma¬ 
chine learnin; 
such as 


literatu re 0 , 0, [li], 0 (see |l] for a review). The corresponding studies 
13, |37| are related to inference with incomplete data |ll| and gener¬ 


ic [ni_„ „ 

ative models |l2|, where Bayesian frameworks 13(1 ar e used for inference under missing 
data conditions. Alternatively, pseudo-likelihood [38| and dependency network [39j ap¬ 
proaches solve data completion problem by learning conditional distributions. On the 
other hand, imputation is commonly used as a pre-processing tool S36|. The Mixture 
of Factor Analyzers f ibl ] approach replaces the missing attributes with samples drawn 
from a parametric density, which models the distribution of the underlying true data. 
K-Nearest Neighbor imputation 14] is shown to be a very effective method 41] despite 
its simplicity. 

Sampling from a mixture of factor analyzers [Icj] or the whole dataset 141 and filling- 
in the missing data attributes is effectively very similar to the feedback information 
insertion in a neural network from a higher layer of neurons onto a lower layer of neurons. 
In this paper, imputation is used as a part of the pseudo-recurrent processing. Instead 
of imputing the missing data, the neural network hidden layer representation of the 
incomplete data is imputed in an iterative fashion, therefore we are proposing a novel 
framework. In our approach, a feedforward neural network makes a class decision at 
its output layer and based on this decision selects an appropriate density to estimate 
selected models’ hidden layer activities. After this sampling stage, the algorithm inserts 
(weighted averaging) the estimated sample as if it is a feedback from a higher layer. This 
procedure is repeated multiple times to emulate the feedforward-feedback iterations in 
an RNN. Other related concepts such as multi-hypotheses feedback and winner-takes- 
all are also implemented to examine their role in this pseudo-recurrent processing. We 
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suggest this approach as a real-time operable alternative to recurrent processing in neural 
networks. 


1.3. Visual Occlusions 


Data completion is essential in image processing for handling corrupted images. Gen¬ 

ii 03 1 


erally, a corrupted image is restored by explicitly learning the image statistics 


or by using neural networks [44], |45|, |46| . These denoising studies treat incompleteness of 
images as a case of noise and filter it out using statistical approaches. Image inpainting 
[47f specifically deals with correcting structured noise in the image such as occlusions. A 
mask is provided by the user in inpainting application (as in the Restricted Boltzmann 
Machine study 0 )' which pinpoints the occluded regions in the image, however a mask 
is almost never available in the general computer vision setting. 

On the other hand, there exist several studies that aim both localization and cor¬ 
rection of occlusions. [H, [H [5(1 (EH H2, EEj- In these studies, occlusion detection is 
performed using domain specific knowledge (visual cues) or external information (object 
geometry). However, these sources are not always available in general data imputation 
setting either. Other studies propose solutions via extracting occlusion maps using sta¬ 
tistical measures. In [541 , HOG based classification errors; and in [55}, template based 
reconstruction errors are used to generate such an occlusion map. In [hfij . a Hidden 
Markov Model (HMM) framework is utilized to estimate a visibility map, that localizes 
occlusions. 

To relieve the degrading effects of occlusion, part based models are also used: com¬ 
ponents are learned by imposing occlusion constraints in [57]], descriptors are extracted 
from various parts of the occluded object in 5S| and similarly; part-based descriptors 
are weighted with the occlusion measure in [59|. Feature matching statistics are learned 
in [boj, visibility of face parts are learned in a cascade in Q and sub-matrix matching is 
utilized in [61| all of which can also be considered as part-based detection approaches. 
Learning occlusion patterns has also been investigated. In 621 163, 64J occlusion patterns 
are learned and localized in an object detection framework. [65[ proposes a recurrent 
localization scheme, in which object detector hypothesizes about the occlusion using 3D 
geometry and top-down feedback corrects the hypothesis. 

In general, occlusion localization is a computationally costly operation, and our ap¬ 
proach aims at correction without localization. This is achieved by an iterative process 
that imitates a recurrent network and alleviates the degrading effect of missing data on 
classification. More importantly our classification approach is not part-based but holis¬ 
tic, in which localization of occlusion becomes more cumbersome. However, it should 
be noted that occlusion detection algorithms are very effective pre-processing tools for 
improving classification performance and our contribution is orthogonal to the literature 
on occlusion detection/localization. 


1-4- Contribution 

We designed a real-time operable neural network based algorithm that has recurrence 
capabilities, suited for solving occlusion problem in object classification, without resort¬ 
ing to user defined occlusion masks or computationally expensive occlusion detection 
mechanisms. The proposed algorithm is a standard convolutional network augmented 
by feedback mechanisms, which is very fast and capable of classifying occluded objects 
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Figure 2: The general architecture of the network and the algorithmic stages in training and test stages. 
The algorithmic stages are shown in arrows and data are shown in boxes, d is the number of channels 
(i.e. RGB has three) in the image, K is the number of distinct receptive fields in the network. See text 
for other details on the algorithmic steps. 


with high accuracy. To our knowledge, this is the first neural network study that sys¬ 
tematically examines the effect of occlusion on classification performance in a standard 
image dataset. 


2. Recurrent Processing 


2.1. Approach 

Recent work on feedforward convolutional networks has proven the importance of 
dense sampling and number of the hidden layer units [Tsj j . The question is how a suc¬ 
cessful feedforward network can be transformed into a computationally not so intense 
pseudo-recurrent one. In our approach, Coates et al.’s network [THj] is adopted and mod¬ 
ified to fill-in incomplete (occluded) visual representations (hidden layer activities). The 
main recurrent processing principles are applied using low complexity operations. The 
nonlinear dynamical equations that construct the attractors in the high dimensional space 
are replaced with linear distance comparators. And costly sampling operations such as 
Markov Chain Monte Carlo (MCMC) or Gibbs 671 are replaced with averaging and 
binary decision operations. In Hopfield networks and Boltzmann machines, learned bidi¬ 
rectional network weights are interpretations of the sensory input, and they are formed 
during training by iterative energy minimization procedures. In our algorithm, these 
memories are formed using K-means clustering and linear filtering. 
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2.2. Algorithm 

In a recurrent network, the hidden layer activity at time t is given as a function 
(function F , parametrized over 6 ) of the hidden layer activity h at time t — 1 and the 
current input x as 

(1) 

In leaky integration approach, hidden layer activity at time t — 1 is added for smoother 
changes, 

h' = 7 h t - 1 + (l- 7 )^(h t -\x t ), ( 2 ) 

where 7 is the leakage from previous hidden layer activity. 

In our framework, we use leaky integration approach and for computational efficiency, 
costly Fg recurrence is replaced with F[ t i.e., 

h‘= yh *" 1 + (1 - 7 )H t , (3) 


where H* is the selected cluster center at time t, representing the retrieved network 
memory for a specific input image. This memory is retrieved such that the distance 
of previous hidden layer activity to a recorded set of hidden layer activities is 

minimized, 


ff 4 = argmin(h* 1 — k F[) 2 . 

k U 


( 4 ) 


Here, H is the set of the cluster centers, having K 2 number of clusters for each class (class 

decision is y). The closest cluster center is utilized (argmin operation) with respect to 

k 

the class decision y at the output layer In contrast to KNN imputation approaches 


14j| in which the search of the best sample is over the whole dataset, vector quantization 


is utilized in our framework to reduce the computational cost (see also [ 681 ] for utiliza¬ 
tion of quantization in Support Vector Machines (SVM)). The closest cluster center H f 
computation is based on the decision on the class label from the output layer. Therefore, 
the network uses its class decision to narrow down the set of candidate probability distri¬ 
butions for sampling hidden layer activity. Overall, high level class decision information 
is used to generate hidden layer activity, that is then merged with the current hidden 
layer activity. Repeating this procedure in a loop emulates the behavior of a dynamical 
system, i.e. RNN. 

In the middle row of Figure [2] network architecture is shown as that (1) it has one 
convolutional hidden layer (which performs Analysis or dimensionality expansion), ( 2 ) a 
subsequent hidden layer computed by Pooling the first hidden layers activities in each 
quadrant and (3) a linear Support Vector Machine (SVM) mimics the output layer of 
the network and performs multi class Classification on Layer 2 activity (see [15[ for more 
details). 


1 For simplicity of notation throughout the paper, the argmin operation output is not an index but a 

selected vector that minimizes the distance between a test vector and a set of target vectors. 
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2.2.1. Training 

During the training phase (lower row of Figure [2]), there are 3 stages that are intro¬ 
duced for recurrent processing: 

1. Filter: The first and second hidden layer activities of every training input image 
is low pass filtered and stored: 

H\ = {h\,h\, ..hi} , for N training examples and Layer 1, (5) 

H 2 = {h\,h\,..h 2 } , for N training examples and Layer 2. (6) 

2. K-Means Clustering: Memory formation in an RNN through costly energy 
minimization is replaced with clustering. The second hidden layer activities (H 2 ) are 
vectorized and clustered using K-means with K 2 number of clusters per class (class 
decision denoted by y) or K 2 x (Numberof Classes) number of clusters for non-class 
specific processing (cf. section lXTl) . These stored memories are used for sampling followed 
by imputation during test. Hidden Layer 2 memory set: 

m = {hi hi, ,.hf 2 }, h 2 = {hi hi ..hi} . (7) 

3. Multi-Hypotheses SVM Training: In an RNN multiple hypotheses can form 
and compete with each other to explain sensory data. Multi hypotheses linear classifi¬ 
cation framework is constructed to imitate this feature. The training is repeated for a 
subset of data in order to allow multiple hypotheses of the network. This is achieved by 
excluding the data of a specific single class (eg. Class 1) or a pair of classes (eg. Class 
1 and Class 2), and training an SVM for the rest of the data. In the case of single class 
exclusion, the trained SVM can be used for supplying a second hypothesis. For example, 
if Class 1 is the first choice of the network that is decided by the full SVM classifier, 
the classifier trained by leaving out Class 1 data is used to give a second hypothesis. In 
the case of a pair of class exclusions, for example both Class 1 and Class 2 data are left 
out, the trained SVM gives a third hypothesis, where the first choice is Class 1 and the 
second choice is Class 2. This collection of classifiers is used during test, to decide which 
cluster centers of hidden Layer 2 activities will be used for feedback insertion. 

S is the SVM classifier for the first choice of the network, S p is the SVM classifier for 
the second choice when the first choice was class p and S pq is the SVM classifier for the 
third choice when the first choice was class p and second q. 


2.2.2. Test 

The test phase (upper row of Figure [2]) has the following stages for recurrent process¬ 
ing: 

1. Pooling: Test phase starts with the algorithm provided by Coates et al. |l5| and 
computes hidden Layer 2 activity via pooling Layer 1 activity. For test image i, at time 


= t: 

hi,’ 4 = P(h('*), where P is the pooling operation. (8) 

2. Multi-Hypotheses SVM Testing: First, second and third class label choices 
of the network are extracted using the corresponding linear SVM (shown as S() below) 
in the classifier memory: 


= sy 1 (hl)y = s yly2 (hi). 
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y 1 = y 2 


(9) 



3. Sampling: For each class hypothesis, the cluster centers in the hidden Layer 
2 memory (H 2 ) which are closest (Euclidian distance) to the current hidden Layer 2 
activity are computed. These are hidden layer hypotheses of the network. 3 cluster 


hypothesis) are computed as follows: 


hft = argmin(h2 ,t - IFf ,fc ) 2 , 

k 

(10) 

= argmin(hi‘ - ’ fc ) 2 , 

k 

(11) 

hif 3 = argmin(h 2 ,t - R'l ,fc ) 2 . 

(12) 


4. Competition: In a winner-takes-all configuration, the closest of the clusters 
computed above is chosen as the Layer 2 hidden activity sample: 


h 2,A = argmin(h2* - h*/) 2 . 


(13) 


For the average configuration, the average of the m clusters (for each class) is assigned 
as the sample: 


1 


h 2,A = - £]( h 2,rn)- 


(14) 


n—1 


For non-class specific configuration, instead of computing 3 closest centers for each of the 
class hypotheses, 3 closest clusters are computed regardless of class hypotheses. Another 
hidden Layer 2 memory set (.02, see above) is used for distance computation: 


h i,t 

2,A 


= argmin(h2 - H$y , 

k 


(15) 


5. Feedback (Layer 2):The Layer 2 sample is merged (feedback magnitude, a) with 
the test image Layer 2 activity, to generate hidden layer activity at time t + 1: 


uM+i 

rl 2 


1 + a 


(h^ + ah^ 


2.A/ ' 


(16) 


6. Layer 1 Sampling: The modified hidden Layer 2 activity is used to compute the 
most similar training set image, using the Euclidean distance: 

L 2 * = argmin(h 2 t+1 — FT^) 2 , which is the index of the most similar training data. (17) 

k 

The hidden Layer 1 activity of the most similar training data is fetched from the Layer 
1 memory, as the Layer 1 sample of the network: 


hi; A = 


(18) 


7. Feedback (Layerl): The Layer 1 sample is merged (feedback magnitude /?) with 
the test image Layer 1 activity, to generate hidden layer activity at time t + 1. 


hi t+1 


— (hi* + /3hi;j. 
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1 


(19) 





Figure 3: Sample images used in the experiments. Original image, 11, 25, 33 and 50 percent occlusions 
are demonstrated. 


8. Pooling (again): Layer 1 activity is pooled to compute Layer 2 activity. Then, 
this activity is averaged (feedback ratio, r) with previously computed Layer 2 activity. 

^2 t+1 ~ T^—W t+1 +rP(hi’ t+1 )). (20) 

i + T 

This procedure is repeated for multiple iterations starting from the second stage. The 
feedback magnitude is halved at each iteration for simulated annealing purposes. At the 
end of iterations, y 1 is given as the classification output of the algorithm. 


3. Experiments 


For performance evaluation, the CIFAR 10 datasets [lij test batch is modified to sim¬ 
ulate occlusions (Figure [3]). The middle of the images is deleted, i.e., filled in with zeros, 
for various area-wise occlusion percentages. Occlusion pattern is assumed unknown. The 
reason for occluding the middle part is for making the case harder: the effect of occlusion 
is more or less distributed to hidden layer activities due to pooling. Therefore we do not 
present results on other easier occlusion locations (eg. upper left corner). The occlusion 
manifests more as the distortion in the hidden unit activities as opposed to missing data. 
However, due to the uniform nature of the occluder, the distortion is still localized to 
a subset of the receptive fields. Fifty thousand training images and ten thousand test 
images are used for the experiments. The performance of the original algorithm [16 1 


on the occluded test images are shown in FigureS] (left), for three different numbers of 
hidden layer units. The accuracy drops down to chance level for 50% occlusion. 

In the first set of experiments (13.1113.2113.3113.41) , only Layer 2 sampling and feedback 
is executed (steps G-7-8-9 are skipped) to test the sole effect of higher level feedback into 
the network. 


3.1. Nonlinear vs. Recurrent Processing 

The accuracy improvement due to recurrent processing is shown in Figure 0] (right) 
(no occlusion in training data, but occlusion in the test data). We emphasize that 
the recurrent processing does not impair performance in zero occlusion case. It improves 
performance as much as 2x for 33% occlusion case. When compared with the performance 
of a nonlinear SVM (RBF kernel, hidden layer activities as the feature space, grid 
search for best SVM parameter), it is observed that the recurrent processing exceeds 
the performance of nonlinear SVM (for most occlusion cases) but with only a fraction of 
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Figure 4: Left: The performance of the feedforward network in classifying occluded objects in CIFAR 10 
dataset, for different numbers of hidden layer units shown separately. Right: The accuracy improvement 
when linear SVM is replaced with RBF kernel nonlinear SVM and when pseudo-recurrent processing 
(only Layer 2 feedback) is applied. 


40- 

,-.35 

£ 

2 30 

3 

o 

O 

<25 


20 L 


======== . . 


-■-K=100 
□ ■K=200 
-H--K=400 


50 100 150 200 

Number of Clusters (K2) 


fu'' 


- □- -D— Ha— D—□ 


-■-•K2=25 
□ ■K2=50 
CPK2=100 
-■-■K2=200| 


2 4 6 

Number of Iterations 


: t 




E) 

/□ 

/ u 


1 Iteration 
O 2 Iterations 
0-4 Iterations 


5 10 15 20 

Feedback Magnitude 


Figure 5: Left: The accuracy of the algorithm in 33% occlusion case as a function of number of clusters 
per class (K 2 ). Middle: Percent accuracy improvement due to recurrent processing as a function of num¬ 
ber of feedforward-feedback iterations, for different K 2 values. Right: Percent accuracy improvement 
due to recurrent processing as a function of feedback magnitude, for different number of iterations. 
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Figure 6: Left: In these experiments, the hidden layer activities of the convolutional network is used 
as input to the Restricted Boltmann Machine (RBM). The Gibbs sampling on RBM visible units is 
used instead of the feedback processing algorithm proposed in this paper. After some number of Gibbs 
epochs, the regularized convolutional hidden layer activity is used for classification. Right: The effect 
of Gibbs sampling on RBM visible units for different number of Gibbs epochs. The RBM processing 
does not improve occluded object classification, moreover it is detrimental for un-occluded objects. 


the computational cost at test time: 13 sec nonlinear SVM (C++ implementation) vs 1 
sec recurrent processing (Matlab implementation). A comparable implementation of the 
algorithm and systematic timing experiments are planned as future work but still we can 
make complexity comparisons. The computational complexity of the nonlinear SVM is 
0(ND) where N is the number of hidden layer neurons, and D is the number of support 
vectors. The complexity of the proposed algorithm is 0(NK2 ) where K 2 is the number 
of cluster centers learned during training. The experiments indicate huge savings: D is 
orders of magnitude larger than K 2 - 

It is observed that nonlinear SVM does not outperform linear SVM for zero occlusion 
case but improves performance for occluded objects. Hence, performance-wise, recurrent 
processing upgrades linear SVM into nonlinear SVM, using small amount of compu¬ 
tational resources. The theoretical and experimental investigation of nonlinear SVM’s 
power in incomplete data classification is out of the scope of this paper and should 
definitely be pursued. 


3.2. Restricted Boltzmann Machines Under Unknown Occlusion Mask 

Both in inpainting studies T7, 70] and Restricted Boltzmann Machine (RBM) exper¬ 
iments Q, it is assumed that the region of occlusion is known and a mask is given to the 
algorithm, after which the missing information is generated through a sampling process. 
We experimented with the data generation capability of RBMs when the location of the 
occlusion is unknown. In order to make a valid comparison, we used binarized hidden 
layer activities of the convolutional network as the input to a classical RBM network 
(Figure [Gj left), instead of the image. In this configuration RBM learns the statistics of 
the convolutional network activities. We trained a single layer fully connected RBM with 
800 hidden neurons (K=200, batch size 100, learning rate 0.1, with 100 training epochs). 
RBM network as a generative engine, is intended to replace the recurrent processing 
proposed in this study by applying alternating Gibbs sampling process [67] • In this pro- 
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Figure 7: Left: Accuracy improvement due to recurrent processing for two different sampling schemes, 
winner-takes-all and averaging. Right: Accuracy improvement due to recurrent processing in non-class 
specific sampling and class specific feedback (winner-takes-all and averaging) are shown as a function of 
number of iterations. 


cedure the visible layer of the RBM (that receives hidden layer activity of convolutional 
network) is initialized with the occluded test input and Gibbs sampling is expected to 
correct/regularize the distorted data. We experimented with different numbers of Gibbs 
epochs (Figure [G] right), and observed that RBM is not able to improve performance 
of classification for occluded objects. Moreover, serious performance drop was observed 
for non-occluded (original) test set. The failure of RBM can be explained by unsubtle 
changes in the visible units during sampling and large unavoidable reconstruction errors 
(around 70, that is almost 10% of the signal). Consequently, the erratic changes in the 
visible units during sampling impairs more than what can be gained by the corrections 
on the corrupted data. This distortion is more dramatic when there is no occlusion in 
the image. We conclude that RBM is not suitable for occlusion handling when the the 
location of occlusion is not provided. 

3.3. Number of Clusters, Iterations and Magnitude 

For analyzing our network to various choices of the parameters, we conducted several 
experiments (for 33% occlusion case). Number of clusters (K 2 ) is varied and the results 
(Figure [5] left) show that the performance is saturated at around 50 cluster centers per 
class. Number of feedforward-feedback iterations is examined and the results (Figure [5] 
middle) indicate that 3 iterations are enough to reach maximum accuracy improvement. 
The feedback magnitude (a) is cross-varied with number of iterations in another set of 
batch runs (Figure [5j right). It is observed that a large feedback magnitude with only 
one iteration can attain the same performance gain of larger number of iterations with 
lower feedback magnitude, and iteration number is an important parameter for real¬ 
time operation concerns. However, a large feedback magnitude is observed to be more 
detrimental for zero occlusion case. 

3-4- Number of Hypotheses and Competition 

Accuracy improvement due to recurrent processing is tested for two different sampling 
schemes (for 33% occlusion case): winner-takes-all and average. In winner-takes-all, the 
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Figure 8: Left: Accuracy improvement change (wrt to Layer 2 feedback improvement) due to Layer 1 
feedback as a function of Layer 1 feedback magnitude. Right: Accuracy improvement change due to 
Layer 1 feedback as a function of Layer 1 feedback ratio. 


cluster center closest to the current image hidden layer activity is chosen as the Layer 
2 sample among the cluster centers from computed class hypotheses. Accordingly, only 
one cluster wins to be fed-back into the hidden layer activity (see [71] for the effect of 
winner-takes-all behavior in neural networks). In average scheme, the cluster centers 
from different hypotheses are averaged and fed back into the hidden Layer 2 activity. 
The accuracy improvement results (Figure 0 left) show inferiority of averaging scheme, 
getting worse with the number of class hypotheses. Another sampling scheme is using 
non-class specific cluster centers that are learned from the whole data. In this scheme, 
the hidden layer activity is assumed to be disconnected from output layer decisions, 
and feedback is generically computed by finding the closest cluster center in non-class 
specific cluster memory. The accuracy improvement results (Figure 0 right) show the 
superiority of class specific (winner-takes-all) feedback, both in terms of performance and 
speed. Specifically, non-class specific feedback impairs classification performance as the 
feedforward-feedback iterations proceeds. 

3.5. Recurrent Processing in Hidden Layer 1 

In order to investigate the effect of feedback onto lower layers of the network, Layer 
1 sampling procedures (stage 6 explained in section f2. 2. 211 are executed and the Layer 1 
sample is fed back into the network hidden layer activity (stage 7). It should be noted 
that Layer 1 feedback is a more costly operation (for both memory and processing) than 
Layer 2 feedback, because of a large memory of Layer 1 hidden layer activities and the 
need for pairwise distance computations to all training data (stage 6). 

The effect of Layer 1 feedback magnitude and feedback ratio is examined in a 
set of experiments. The results (Figure [5]) show that Layer 1 feedback gives marginal 
accuracy gain compared to Layer 2 feedback. The gain is present especially for small 
feedback magnitude condition in Layer 2 and only in the first iteration, for which there 
is still room for improvement. However, the net effect of Layer 1 feedback is close to 
zero (or negative) for larger number of iterations. Layer 1 feedback seems to distort the 
hidden layer activities too much, impairing classification performance after first feedback 
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Figure 9: The effect of using occluded images in the training data. We included occluded versions of every 
training image (11, 25, 33 and 50 percent occlusions) to the training image set. Left: The classification 
accuracy with respect to occlusion levels, for original training set and augmented set. Right: The 
percent improvement in classification accuracy with respect to occlusion levels for for original training 
set and augmented set. 


iteration. However, one iteration of combined Layer 1 and Layer 2 feedback gives the 
same performance of multiple only Layer 2 feedback iterations. 

3.6. Training with Occluded Images 

Artificially expanding the data set using image transformations have been successfully 
used for reducing overfitting [7)1, [73[ [3] ■ We tested this approach for improving classi¬ 
fication performance on occluded objects. Training image set is expanded by including 
occluded versions of the images, all 4 levels of occlusion (11, 25, 33, 50). To the best of 
our knowledge, this type of augmentation was not reported in the literature before. The 
receptive fields and the cluster centers are unchanged, but only the SVM classifiers are 
trained on the new dataset. The results show a great improvement in classification accu¬ 
racy due to this data augmentation (Figure [HI left). Within this configuration, feedback 
algorithm still enhances performance but its impact is reduced (Figure [9l right). 

4. Analysis 

Correction of corrupted hidden layer activity with closest cluster center imputation 
improves classification performance significantly, especially for medium level occlusions. 
The mechanisms of this improvement is illustrated in Figure 1101 In our toy example, 
feature dimension and number of classes are both two, and we assume that each class is 
composed of distinct clusters. This assumption is valid for ecological object classification. 
Visual occlusion distorts a subset of the feature attributes (dark red point), and sweeps 
the data onto another class region (Figure [T0l left) which causes incorrect classification. 
However, for some of these distortions, the closest cluster to the new data point still 
belongs to the correct class (Figure ITOl middle). If the data point is moved towards the 
closest cluster center, it may land onto the correct class region and saved from incorrect 
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Figure 10: The toy example to illustrate the mechanism of imputation. There are two classes (rectangles 
and circles) and two feature dimensions. The classes are composed of distinct clusters. Left: Visual 
occlusion distorts a subset of the feature attributes (dark red point), and sweeps the data onto another 
class region, which causes incorrect classification. Middle: However, for some of these distortions the 
closest cluster to the new data point still belongs to the correct class. Right: If the data point is moved 
towards the closest cluster center, it may land onto the correct class region and saved from incorrect 
classification. 


classification fFigure flOl right). By moving the data towards the cluster mean, the log 
likelihood (Gaussian assumption) of the correct class (cl) is improved: 


Pd(x) 

Pc2{x) 


(x - Hd ) 2 
(x - fi c2 ) 2 1 


where x is the data vector, [i c i and p c2 are cluster centers. 


( 21 ) 


In this approach, the cluster nature of the data and the locality of adversary effect 
of the occlusion is exploited, however the illustrated correction is not guaranteed. The 
probability of correction drops for severe occlusions (the data moves very far away from 
the correct cluster) and for more uniform distribution of the data (closest cluster center 
is more likely to belong to the incorrect class). Localization of occlusion is beneficial 
for more accurately estimating the closest cluster center, such that once the occlusion 
is localized to a subset of the feature attributes the ’healthy’ attributes give a more 
reliable distance measure to the cluster centers. Therefore it is essential that an occlusion 
localization is devised for further improvement in occluded object classification. 


5. Computational Complexity 

A comparable implementation of the algorithm and systematic timing experiments are 
planned as future work but still we can make complexity comparisons. The complexity of 
the proposed algorithm is (D(NK 2 ) where N is the number of hidden layer neurons and K 2 
is the number of cluster centers learned during training. The computational complexity 
of the nonlinear SVM is O(ND) where D is the number of support vectors. Whereas the 
computational complexity of Restricted Boltzmann Machines are 0(N 2 K 2 ), in which 
K is the number of hidden layer components. First observation is that D is orders of 
magnitude larger than K 2 for interesting problems in computer vision. Secondly we can 
infer that RBMs are computationally much demanding than our approach by looking 
at the big-oh complexity. There are GPU implementations of RNN algorithms that can 
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be trained and tested on large datasets, but compensation via hardware and software 
speed-ups does not alleviate the inherent complexity of these algorithms. 

6. Discussion and Future Work 

In our work we attempted to alleviate the impairment caused by visual occlusions, and 
we adopted a blind approach: the location of the occlusion is unknown. To our knowledge, 
this is the first recurrent neural network study that systematically examines the effect 
of occlusion on classification performance in a standard image dataset and in this blind 
setting. In recurrent neural network literature, it is assumed that the occlusion is detected 
and localized as a preprocessing step, but in general it is computationally expensive 
and a gold standard occlusion localization approach is nonexistent. We introduced a 
recurrent algorithm that imputes the hidden layer activities of a convolutional neural 
network, without resorting to the occlusion detection step. Iteratively merging hidden 
layer activities with samples from a mixture model improved classification accuracy for 
occluded objects up to 100%. Selecting a single sample is shown to work better than 
averaging many class hypotheses or non-class specific selection. Feedback to lower layers 
has shown diminishing return. 

In our experiments, there is no variability on the location of the occluder: it is always 
in the center of the image. This is chosen to make the case as hard as possible for the 
algorithm. Yet, it should be noted that the algorithm does not use any information or 
an ad-hoc procedure to exploit this invariance in location. Thus it is possible to assert 
that the algorithm works in the worst possible scenario of occlusion location. 

We tested the performance of RBMs in this blind setting, and observed that Gibbs 
sampling in the RBM visible units is not capable of restoring distorted feature attributes. 
Hence, energy based recurrent networks seem to require a localization stage in order to 
accurately classify occluded objects. 

Moreover we experimented executing neural network training with occluded images, 
which is ecologically more natural. It is observed that the performance is enhanced 
significantly when some of the training images also suffer from occlusions. To the best 
of our knowledge, this type of data augmentation was not reported in the literature and 
this observation is novel. 

As a future work, a second convolutional layer can be introduced after Layer 1 of the 
proposed network using unsupervised methods in determining the local receptive fields 
[75|. In that case it is possible to devise more plausible inter-layer feedback from second 
convolutional layer to the first. This will extend the scope of pseudo-recurrent processing 
and will allow more complicated multi-layer feedforward-feedback loops. Also, detection 
and localization of occlusion seems beneficial, especially for hidden Layer 1 feedback, and 
devising a fast and effective occlusion detection needs further development. Moreover, 
the nonlinear SVM’s power for incomplete data classification and data augmentation 
with occluded images seem very promising and need further investigation. 
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